On 4th of December, Tom Lane committed really cool patch:
KNNGIST, otherwise known as order-by-operator support for GIST.
This commit represents a rather heavily editorialized version of Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1 patches. I redid the opclass API to add a separate Distance method instead of turning the Consistent method into an illogical mess, fixed some bit-rot in the rbtree interfaces, and generally worked over the code style and comments. There's still no non-code documentation to speak of, but I'll work on that separately. Some contrib-module changes are also yet to come (right now, point <-> point is the only KNN-ified operator). Teodor Sigaev and Tom Lane
The biggest problem with describing it is that is “just" adds some additional features that are not really visible from the SQL point of view, until someone will use it.
Luckily – very soon (same day) Tom Lane committed second patch, which adds KNNGIST usage to pg_trgm, so I have a way to show you how cool it is.
First, because I will be using pg_trgm in this example – you should know what it is. If you don't know – it is a way to match words using trigrams.
Simple example:
$ CREATE TABLE words ( word text ); CREATE TABLE $ copy words FROM '/usr/share/dict/american-english-insane'; COPY 638645
Data look like this:
$ SELECT * FROM words ORDER BY random() LIMIT 10; word ----------------- heterograft's totipalmation's Cretaceous Snoquamish's inca hydremia's slovenlinesses Lysite Jud's Aquebogue's (10 ROWS)
Trigrams look like:
$ SELECT show_trgm('depesz'); show_trgm ------------------------------------- {" d"," de",dep,epe,esz,pes,"sz "} (1 ROW)
With GiST indexes, I can find all words similar to given word:
$ CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops); CREATE INDEX
and now:
$ SELECT word, similarity(word, 'depesz') AS sml FROM words WHERE word % 'depesz' ORDER BY sml DESC, word LIMIT 10; word | sml ---------+---------- depe | 0.5 depel | 0.444444 Depew | 0.444444 DePew | 0.444444 depend | 0.4 Depere | 0.4 deperm | 0.4 deepest | 0.363636 depeach | 0.363636 depends | 0.363636 (10 ROWS)
This is all sweet, but the problem lies in how the results are obtained:
$ EXPLAIN analyze SELECT word, similarity(word, 'ing') AS sml FROM words WHERE word % 'ing' ORDER BY sml DESC, word LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=1650.25..1650.28 ROWS=10 width=10) (actual TIME=105.885..105.885 ROWS=10 loops=1) -> Sort (cost=1650.25..1651.85 ROWS=639 width=10) (actual TIME=105.884..105.884 ROWS=10 loops=1) Sort KEY: (similarity(word, 'ing'::text)), word Sort Method: top-N heapsort Memory: 25kB -> Bitmap Heap Scan ON words (cost=29.44..1636.45 ROWS=639 width=10) (actual TIME=104.076..105.664 ROWS=645 loops=1) Recheck Cond: (word % 'ing'::text) -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..29.28 ROWS=639 width=0) (actual TIME=104.037..104.037 ROWS=645 loops=1) INDEX Cond: (word % 'ing'::text) Total runtime: 105.922 ms (9 ROWS)
Generally – PostgreSQL has to fetch all rows matching our criteria, sort it, and then get top 10.
But. With KNNGIST – PostgreSQL can get the rows already sorted in required order. Which means there is less rows to check in table data, and we can skip the sort. And it looks like this:
$ EXPLAIN analyze SELECT word, similarity(word, 'ing') AS sml FROM words WHERE word % 'ing' ORDER BY word <-> 'ing' LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.00..37.13 ROWS=10 width=10) (actual TIME=27.577..91.367 ROWS=10 loops=1) -> INDEX Scan USING trgm_idx ON words (cost=0.00..2372.46 ROWS=639 width=10) (actual TIME=27.575..91.362 ROWS=10 loops=1) INDEX Cond: (word % 'ing'::text) ORDER BY: (word <-> 'ing'::text) Total runtime: 91.406 ms (5 ROWS)
Time difference is not big, but the table I am using is not really impressive – ~ 600k words only.
We can also try to use it with geometry queries:
$ CREATE TABLE test ( POSITION point ); CREATE TABLE
Table created. Now let's insert some random points:
$ INSERT INTO test (POSITION) SELECT point( random() * 1000, random() * 1000) FROM generate_series(1,1000000); INSERT 0 1000000
1 million points should be enough for my example. All of them have both X and Y in range <0, 1000). Now we just need the index:
$ CREATE INDEX q ON test USING gist ( POSITION ); CREATE INDEX
And we can find some rows close to center of the points cloud:
$ SELECT *, POSITION <-> point(500,500) FROM test ORDER BY POSITION <-> point(500,500) LIMIT 10; POSITION | ?COLUMN? -------------------------------------+------------------- (499.965638387948,499.452529009432) | 0.548548271254899 (500.473062973469,500.450353138149) | 0.65315122744144 (500.277776736766,500.743471086025) | 0.793668174518778 (499.986605718732,500.844359863549) | 0.844466095200968 (500.858531333506,500.130807515234) | 0.868439207229501 (500.96702715382,499.853323679417) | 0.978087654172406 (500.975443981588,500.170825514942) | 0.990289007195055 (499.201623722911,499.368405900896) | 1.01799596553335 (498.899147845805,500.683960970491) | 1.29602394829404 (498.38217580691,499.178630765527) | 1.81438764851559 (10 ROWS)
And how about speed?
$ EXPLAIN analyze SELECT *, POSITION <-> point(500,500) FROM test ORDER BY POSITION <-> point(500,500) LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.00..0.77 ROWS=10 width=16) (actual TIME=0.164..0.475 ROWS=10 loops=1) -> INDEX Scan USING q ON test (cost=0.00..76512.60 ROWS=1000000 width=16) (actual TIME=0.163..0.473 ROWS=10 loops=1) ORDER BY: ("position" <-> '(500,500)'::point) Total runtime: 0.505 ms (4 ROWS)
Now, that's something really cool. Great work!
The patch is great!
Try to increase SIGLENINT in trgm.h. I beleive that 3 in too small for effective search.
Whoa. Anybody seen my jaw? I think I dropped it somewhere around here.
Is it on purpose that the trigram query with and without the index are not on the same literal?
Wow, will we be able to use this to improve the performance of finding zipcodes nearby?
@Mark:
yes. As long as you will store also geographical location per zip code.
@depesz: great! From reading closely, it sounds like we have to wait for involved operators to be “KNNifed”. In my case that’s the “@” operator for the cube type. Hopefully that’s not too hard for someone.
Although, perhaps due to this approach some other method of calculating zipcode distances will be faster than cube search method.
Thanks for the interesting post highlighting this new functionality? Do you know if it is possible to use this extension for nearest neighbour searches on 3D points (in my case neurons in the brain)? As far as I am aware the point type is 2D only.
@Greg:
as far as I understand, this functionality will work as long as you’ll have point3d datatype. which (afaik) you don’t have currently. I am not aware of any 3d-related datatypes and their index support.